[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

ssahasra · 2025-07-07T09:11:20Z

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strengthen based on its analysis of whether direct loads to LDS are pending at this point in the program.

llvmbot · 2025-07-07T09:11:52Z

@llvm/pr-subscribers-backend-amdgpu

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strenghthen based on its analysis of whether direct loads to LDS are pending at this point in the program.

Patch is 4.42 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147257.diff

41 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+25-33)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+112)
(modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+66-66)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/kernel-vgpr-spill-mubuf-with-voffset.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+64)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+168-6)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+220-4)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+160-32)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+1420)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+160-32)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+14-2)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+1410)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+576-68)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+192)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+1152-52)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+168)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+14-1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+1152-52)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+706-82)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+3-2)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+1548)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+27-7)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+1548)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local.mir (+31)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+12)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-region.mir (+31)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.mir (+1)

diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..f015d3ad7811e 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1074,8 +1074,6 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                     SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                     bool IsCrossAddrSpaceOrdering, Position Pos,
                                     AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -1149,21 +1147,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx6CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1966,8 +1962,6 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2057,28 +2051,25 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (VSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
         .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
         .addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -2287,8 +2278,6 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2372,23 +2361,26 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
     }
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
-    Changed = true;
+  } else {
+    // Always emit a soft wait count, even if it is trivially ~0.
+    // SIInsertWaitcnts will later use this marker to add additional waits such
+    // as those required from direct load to LDS (formerly known as LDS DMA).
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft))
+        .addImm(getLoadcntBitMask(IV));
   }
 
   if (STORECnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_STORECNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (DSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_DSCNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx12CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..1bbbec977b714 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -880,8 +880,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -921,8 +921,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -991,8 +991,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1032,8 +1032,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; UNALIGNED_GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1520,8 +1520,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1633,8 +1633,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..ea6a5d5e74d52 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -199,26 +199,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acquire() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acquire
@@ -228,26 +234,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_release() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") release
@@ -257,26 +269,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acq_rel
@@ -286,26 +304,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") seq_cst
@@ -501,10 +525,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acquire
@@ -516,6 +542,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acquire
@@ -527,6 +554,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acquire
@@ -536,10 +564,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_release
@@ -550,6 +580,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -560,6 +591,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") release
@@ -569,10 +601,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -584,6 +618,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -595,6 +630,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acq_rel
@@ -604,10 +640,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -619,6 +657,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -630,6 +669,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") seq_cst
@@ -639,26 +679,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acquire() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") acquire
@@ -668,26 +714,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_release() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") release
@@ -697,26 +749,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ;...
[truncated]

llvmbot · 2025-07-07T09:11:53Z

@llvm/pr-subscribers-llvm-globalisel

Author: Sameer Sahasrabuddhe (ssahasra)

Changes

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations.

As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strenghthen based on its analysis of whether direct loads to LDS are pending at this point in the program.

Patch is 4.42 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/147257.diff

41 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+25-33)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+112)
(modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+66-66)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/kernel-vgpr-spill-mubuf-with-voffset.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.update.dpp.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+64)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+168-6)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+220-4)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (+160-32)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-singlethread.ll (+1420)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll (+160-32)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+14-2)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-wavefront.ll (+1410)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+576-68)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (+192)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-singlethread.ll (+1152-52)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll (+168)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+14-1)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-wavefront.ll (+1152-52)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+706-82)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-nontemporal.ll (+3-2)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-singlethread.ll (+1548)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+27-7)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-wavefront.ll (+1548)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+940-98)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local.mir (+31)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+12)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-region.mir (+31)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.mir (+1)

diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 3212060f303a5..f015d3ad7811e 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -1074,8 +1074,6 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                     SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                     bool IsCrossAddrSpaceOrdering, Position Pos,
                                     AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -1149,21 +1147,19 @@ bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx6CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1966,8 +1962,6 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2057,28 +2051,25 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
     }
   }
 
-  if (VMCnt || LGKMCnt) {
-    unsigned WaitCntImmediate =
-      AMDGPU::encodeWaitcnt(IV,
-                            VMCnt ? 0 : getVmcntBitMask(IV),
-                            getExpcntBitMask(IV),
-                            LGKMCnt ? 0 : getLgkmcntBitMask(IV));
-    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
-        .addImm(WaitCntImmediate);
-    Changed = true;
-  }
+  // Always emit a soft wait count, even if it is trivially ~0. SIInsertWaitcnts
+  // will later use this marker to add additional waits such as those required
+  // from direct load to LDS (formerly known as LDS DMA).
+  unsigned WaitCntImmediate = AMDGPU::encodeWaitcnt(
+      IV, VMCnt ? 0 : getVmcntBitMask(IV), getExpcntBitMask(IV),
+      LGKMCnt ? 0 : getLgkmcntBitMask(IV));
+  BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_soft))
+      .addImm(WaitCntImmediate);
 
   if (VSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
         .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
         .addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -2287,8 +2278,6 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
                                      SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                      bool IsCrossAddrSpaceOrdering,
                                      Position Pos, AtomicOrdering Order) const {
-  bool Changed = false;
-
   MachineBasicBlock &MBB = *MI->getParent();
   DebugLoc DL = MI->getDebugLoc();
 
@@ -2372,23 +2361,26 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
     }
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
-    Changed = true;
+  } else {
+    // Always emit a soft wait count, even if it is trivially ~0.
+    // SIInsertWaitcnts will later use this marker to add additional waits such
+    // as those required from direct load to LDS (formerly known as LDS DMA).
+    BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft))
+        .addImm(getLoadcntBitMask(IV));
   }
 
   if (STORECnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_STORECNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (DSCnt) {
     BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_DSCNT_soft)).addImm(0);
-    Changed = true;
   }
 
   if (Pos == Position::AFTER)
     --MI;
 
-  return Changed;
+  return true;
 }
 
 bool SIGfx12CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 8a80afd4a768f..1bbbec977b714 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -880,8 +880,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -921,8 +921,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -991,8 +991,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1032,8 +1032,8 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
 ; UNALIGNED_GFX11-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX11-NEXT:    scratch_store_b32 v0, v2, off offset:384 dlc
-; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX11-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX11-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX11-NEXT:    s_add_u32 s0, 0x100, s0
 ; UNALIGNED_GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -1520,8 +1520,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
@@ -1633,8 +1633,8 @@ define amdgpu_kernel void @store_load_vindex_large_offset_kernel(i32 %n) {
 ; UNALIGNED_GFX10-NEXT:    v_lshlrev_b32_e32 v1, 2, v1
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v0, 0x4004, v0
 ; UNALIGNED_GFX10-NEXT:    scratch_store_dword v0, v2, off offset:128
-; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; UNALIGNED_GFX10-NEXT:    s_waitcnt_vscnt null, 0x0
 ; UNALIGNED_GFX10-NEXT:    s_lshl_b32 s0, s0, 7
 ; UNALIGNED_GFX10-NEXT:    s_add_u32 s0, 0x4004, s0
 ; UNALIGNED_GFX10-NEXT:    v_add_nc_u32_e32 v1, s0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
index 66037615f0ba0..ea6a5d5e74d52 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
@@ -199,26 +199,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acquire() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acquire
@@ -228,26 +234,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_release() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") release
@@ -257,26 +269,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") acq_rel
@@ -286,26 +304,32 @@ entry:
 define amdgpu_kernel void @singlethread_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: singlethread_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: singlethread_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: singlethread_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("singlethread-one-as") seq_cst
@@ -501,10 +525,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acquire
@@ -516,6 +542,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acquire
@@ -527,6 +554,7 @@ define amdgpu_kernel void @workgroup_one_as_acquire() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acquire
@@ -536,10 +564,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_release
@@ -550,6 +580,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -560,6 +591,7 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") release
@@ -569,10 +601,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -584,6 +618,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -595,6 +630,7 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") acq_rel
@@ -604,10 +640,12 @@ entry:
 define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ; GFX6-LABEL: name: workgroup_one_as_seq_cst
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: workgroup_one_as_seq_cst
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -619,6 +657,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -630,6 +669,7 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
   ;
   ; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("workgroup-one-as") seq_cst
@@ -639,26 +679,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acquire() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acquire
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acquire
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acquire
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acquire
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acquire
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") acquire
@@ -668,26 +714,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_release() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_release
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_release
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_release
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_release
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_release
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_release
   ; GFX11CU: bb.0.entry:
+  ; GFX11CU-NEXT:   S_WAITCNT_soft 65527
   ; GFX11CU-NEXT:   S_ENDPGM 0
 entry:
   fence syncscope("wavefront-one-as") release
@@ -697,26 +749,32 @@ entry:
 define amdgpu_kernel void @wavefront_one_as_acq_rel() #0 {
   ; GFX6-LABEL: name: wavefront_one_as_acq_rel
   ; GFX6: bb.0.entry:
+  ; GFX6-NEXT:   S_WAITCNT_soft 3967
   ; GFX6-NEXT:   S_ENDPGM 0
   ;
   ; GFX8-LABEL: name: wavefront_one_as_acq_rel
   ; GFX8: bb.0.entry:
+  ; GFX8-NEXT:   S_WAITCNT_soft 3967
   ; GFX8-NEXT:   S_ENDPGM 0
   ;
   ; GFX10WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10WGP: bb.0.entry:
+  ; GFX10WGP-NEXT:   S_WAITCNT_soft 65407
   ; GFX10WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX10CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX10CU: bb.0.entry:
+  ; GFX10CU-NEXT:   S_WAITCNT_soft 65407
   ; GFX10CU-NEXT:   S_ENDPGM 0
   ;
   ; GFX11WGP-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11WGP: bb.0.entry:
+  ; GFX11WGP-NEXT:   S_WAITCNT_soft 65527
   ; GFX11WGP-NEXT:   S_ENDPGM 0
   ;
   ; GFX11CU-LABEL: name: wavefront_one_as_acq_rel
   ; GFX11CU: bb.0.entry:
+  ;...
[truncated]

ssahasra · 2025-07-07T09:15:49Z

This is part of a stack:

llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll

The memory legalizer is currently responsible for emitting wait instructions at ordering operations such as acquire and release. It tries to be efficient by emitting waits only when required. In particular, it does not emit a wait on vmcnt at workgroup scope since that ordering is already guaranteed by the architecture. But this is now incorrect because direct loads to LDS have an LDS component which needs explicit ordering on vmcnt. But it is inefficient to always emit a wait on vmcnt since majority of the programs do not use direct loads to LDS, and this will affect all workgroup scope operations. As a first step to that, the memory legalizer now emits a soft wait instruction even if all counts are trivially ~0. This is a placeholder that the SIInsertWaitcnts pass will either optimize away or strenghthen based on its analysis of whether direct loads to LDS are pending at this point in the program.

jayfoad · 2025-07-10T14:31:42Z

Something still doesn't feel right with this PR for me

Me too.

Currently S_WAITCNT_soft has exactly the same execution semantics as S_WAITCNT. (The only difference is whether we want SIInsertWaitcnts to optimize it away if it can prove that it is a no-op.) What I don't like about your patches is that you're assigning some extra meaning to S_WAITCNT_soft vmcnt(63), or to the mere existence of an S_WAITCNT_soft even if the corresponding S_WAITCNT would be a no-op. I think it would be much cleaner to use a different pseudo for that.

jayfoad · 2025-07-10T14:42:21Z

Maybe part of the disagreement is about what S_WAITCNT_soft should be. Since it is currently only used in more-or-less one place (*CacheControl::insertWait in SIMemoryLegalizer) I guess you could argue that it should be defined to do whatever that user needs it to do.

I would prefer that we keep it functionally equivalent to S_WAITCNT. I'd like to think that another pass could insert, say, S_WAITCNT_soft expcnt(0) and have SIInsertWaitcnts treat it as a wait for expcnt that can be relaxed or removed, without getting extra unwanted behaviour to do with waiting for loads-to-LDS.

ssahasra requested review from jayfoad, arsenm, kerbowa, t-tye and Pierre-vh July 7, 2025 09:11

llvmbot added backend:AMDGPU llvm:globalisel labels Jul 7, 2025

ssahasra mentioned this pull request Jul 7, 2025

[AMDGPU] auto update some tests to prepare for future changes #147256

Merged

ssahasra mentioned this pull request Jul 7, 2025

[AMDGPU] efficiently wait for direct loads to LDS at all scopes #147258

Open

Pierre-vh reviewed Jul 7, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll Outdated Show resolved Hide resolved

Base automatically changed from users/ssahasra/waitcnt-update-tests to main July 8, 2025 09:30

ssahasra added 2 commits July 9, 2025 12:55

[AMDGPU] optimize trivial soft wait counts

e6831c6

ssahasra force-pushed the users/ssahasra/waitcnt-always-emit branch from 6e30f37 to e6831c6 Compare July 9, 2025 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

ssahasra commented Jul 7, 2025 •

edited by arsenm

Loading

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 7, 2025

Uh oh!

Uh oh!

jayfoad commented Jul 10, 2025

Uh oh!

jayfoad commented Jul 10, 2025

Uh oh!

Uh oh!

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

Are you sure you want to change the base?

[AMDGPU] always emit a soft wait even if it is trivially ~0 #147257

Conversation

ssahasra commented Jul 7, 2025 • edited by arsenm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025

Uh oh!

ssahasra commented Jul 7, 2025

Uh oh!

Uh oh!

jayfoad commented Jul 10, 2025

Uh oh!

jayfoad commented Jul 10, 2025

Uh oh!

Uh oh!

ssahasra commented Jul 7, 2025 •

edited by arsenm

Loading